I chose the Red Wine set for my EDA project. The goal of the exploratory analysis is to determine which chemical characteristics in the data set have the most effect on the quality of the wine.
#To gain an initial overview of the dataset, I have used the 'dim' and 'str' functions.
dim(redwine)
## [1] 1599 13
str(redwine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The variable ‘quality’ will be converted to an ordered factor variable and will be an output variable. The variable ‘x’ is changed to null. redwine$x <- NULL
#The variable names in the dataset are as follows:
names(redwine)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "quality.f"
#Quality is an output variable as scored by a test panel wine experts and has possible value of 0 to 10 with 10 being the best. The given scores for this dataset range from "3" through "9".
levels(redwine$quality.f)
## [1] "3" "4" "5" "6" "7" "8" "9"
Next, I wanted look at the maximum, minimum and quartile values for each variable. For each variable, the max, min and quartile summary is accompanied by a boxplot overlaying a jitter scatterplot to visually enhance understanding of the distribution and highlight associated outliers in the distribution . This section is presented in the order as the names are listed above.
The first variable is ‘fixed.acidity’. Fixed acidity is measured as gram tartaric acid per liter of wine. The summary indicates and the plots show that for a large majority of red wines in the dataset, the fixed acidity level is between 6 to 11 g/l.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The next variable is volatile acidity. Volatile acidity is the amount in grams of acetic acid per liter of wine. High levels of this compound contributes to the unpleasant taste of wine. The median and the mean values indicate a very normal distribution and the plots show that most redwines in the dataset have a volatile acidity ranging from 0.3 to 0.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The third of the eleven variables in this section is citric acid. Citric acid, measured in g/L of the substance, adds “freshness” and flavor to wines. The particularly histogram highlights that citric acid may be a strong differentiating factor in these wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Residual sugar is the forth variable. Residual sugar, measure in g/L, is the sugar left over after fermentation.
summary(redwine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Chlorides is the next variable. Chlorides are measured by the amount of sodium chloride per liter.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Free sulfur dioxide is the next variable. Free sulfur dioxide, the undissolved portion of sulfur dioxide is in mg per liter.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Total sulfur dioxide is the next variable. Total sulfur dioxide is the free and dissolved sulfur dioxide. Sulfur dioxide is used in wine to prevent microbial growth and oxidation of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Density is the next variable. In g/mL, the density is related to residual sugar and alcohol content and shows a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
pH is the next variable. Most wines are between pH 3-4 and is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Sulphates is the next variable. Sulphates, measured by the amount of potassium sulphate per liter, are also used as antimicrobial agent in wine, contributing to the amount of sulfur dioxide in wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The last variable in this section is alcohol. Alcohol is measured by % volume. It is interesting to see the significant peak in wines with alcohol level of approximately 9.2%.
Now, I wanted to get a view of Quality distribution on the overall dataset; presented below.
Now, with some understanding of the dataset Quality distribution, I wanted to dig a bit deeper on each variable. I noted from the variable summary information that some of the variables appeared to have a normal distribution while others did not. Thus, digging further, I wanted to see if transforming some of the variables would show a more revealing distribution and potential better use for regression analysis.
Residual sugar is not normally distributed.
The variable Free Sulfur Dioxide shows a right side, long tail distribution. Transforming the data with log10 provides a more normal type distribution. As would be expected, Total Sulfur Dioxide has a similar distribution. The ratio of Free Sulfur Dioxide by Total Sulpfur Dioxide has a normal distribution.
Density shows distribution and transformation showed the same normal distribution.
pH shows a normal distribution.
Transformation of Sulhates shows a slightly better defined distribution.
Alcohol does not show a normal distribution and transformation did show any significant insight.
#Univariant Analysis
The resulting “redwine” dataset has 1,599 observations and 13 variables and was downloaded from the Udacity project site. For this analysis the ‘x’ variable was coded null and the ‘quality’ int variable was used as an output variable by introducing a new factored variable ‘quality.f’.
The main features are the quality and alcohol variables. Quality values were determined by a panel of wine experts. The Quality histogram shows that most wines were classified as 5, 6 and 7. None of the wines were classified as 1, 2 or 10.
Intutively, one would want to consider the levels of all components of wine their effect on quality. The boxplots by quality juxtaposed with the summary information for each variable provides some interesting insight into their effect on quality. Review of these boxplots indicates that volatile acidity, citric acid and sulphates have a positive disernable effect on quality while volatile acidity shows a negative effect on quality. Some of the features in the data set are related and will be discussed in the bivariate section..
I madea factor (ordered) variable out of “quality”. I also created a ratio between the free sulfur dioxide and total sulfur dioxide to see how the ratio distribution differed from the individual features. One could do a similar calculation ratio for citric acid to fixed acidity.
A lot of the features didn’t have a normal distribution and transforming them created distributions that approach the normal curve but not totally. Some didn’t change at all.
Fixed acidity is tailing, so transformation was done. The resulting histogram is more normally distributed.
Volatile acidity is skewed to the left and log10 transformation showed the bimodal characteristic of the distribution.
Residual sugar is not normally distributed. Transformation using log10 yielded something like a bimodal distribution.
Chlorides also don’t look normally distributed. Transformation made it look better, but it also revealed a bimodal distribution.
Free sulfur dioxide and total sulfur dioxide had non-normal distribution and transformation didn’t do anything. However, the ratio of the two variables provided a normal distribution.
Transformation of density also didn’t make the distribution better.
Alcohol distribution isn’t normal and transformation didn’t change the distribution.
Regarding tidying the data, the dataset was from the Udacity site provided the data had no missing data values. The only adjustment was to factor and order the Quality variable.
The initial Bivariate relationship I wanted to look at was that of the various variables as they relate to the factored Quality variable.The next set of plots are presented for this comparison. The boxplots are shown with the mean variable value highlighted for each quality scoring. Again, these plots are shown in the name order as presented in the univariate section.
The first variable comparison is with fixed acidity. Visually, the fixed acidity shows minimal fluctuation across the quality spectrum.
create_plot('fixed.acidity')
Next, we have volatile acidity plotted by quality score. There is a clear indication that there are lower levels of volatile acidiity in higher quality wines. This is not suprising in that as we noted earlier volatile acidity contributes to the unpleasant taste in wine. Next, we look at citric acid relative to quality. There is a strong positive relationship between increased levels of citric acid and quality. This is likely ‘the grapes shining through and showing their distintness in the wine’.
Here we have residual sugar vs quality score.
The next relationship is with chlorides.
Free sulfur dioxide relative to quality. Interestingly, the highest quality and lowest quality rated wines have similar and relatively lower free sulfur dioxide content.
Total sulfur dioxide relative to quality rating. This variable shows a normal mean distribution over quality rating.
Next is the wines density relative to quality score. The mean value of a wines density decreases as the quality value improves.
pH relative to quality score.
Next we look at sulphates relative to quality rating. There is a noticable increase in the mean sulphate level as the quality rating improves. Sulphates are often added to wine to help preserve the wines’ taste.
Finally, we look at alcohol relative to quality rating. There is a clear indication that the higher the alcohol level the higher the quality rating.
Building on insight from the previous review, I decided to group several of the variables at a time for analysis. The analysis below shows graphs and Pearson Correlation values using the ggpairs function.
The first group chart below supports prior ananlysis in showing that both alcohol and citric acid have a positive correlation with quality.
g1 <- subset(redwine, select = c("quality", "alcohol", "citric.acid"))
ggpairs(g1)
The next correlation chart below further supports prior analysis by showing the negative correlation between quality and volatile acid and the positive correlation between quality and sulphates.
g2 <- subset(redwine, select = c("quality", "volatile.acidity", "sulphates"))
ggpairs(g2)
As would be expected pH and citric acid have a strong negative correlation. Interestingly, sulphates and citric acid show a reasonably strong positive correlation.
g3 <- subset(redwine, select = c("pH", "citric.acid", "sulphates"))
ggpairs(g3)
The strong positive correlation between density and the variables residual sugar and fixed acidity was interesting.
g4 <- subset(redwine, select = c("density", "residual.sugar", "fixed.acidity"))
ggpairs(g4)
In order to look at the correlation amongst all of the variables, I used a table format as presented below:
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## free_to_total_ratio 0.335438942 -0.13081236 -0.072618561
## residual.sugar chlorides free.sulfur.dioxide
## X -0.031260835 -0.119868519 0.090479643
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## free_to_total_ratio -0.070626080 -0.105156413 0.327240869
## total.sulfur.dioxide density pH
## X -0.11784967 -0.36837209 0.13600533
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## free_to_total_ratio -0.37143493 -0.26497991 0.18489507
## sulphates free_to_total_ratio
## X -0.125306999 0.33543894
## fixed.acidity 0.183005664 -0.13081236
## volatile.acidity -0.260986685 -0.07261856
## residual.sugar 0.005527121 -0.07062608
## chlorides 0.371260481 -0.10515641
## free.sulfur.dioxide 0.051657572 0.32724087
## total.sulfur.dioxide 0.042946836 -0.37143493
## density 0.148506412 -0.26497991
## pH -0.196647602 0.18489507
## sulphates 1.000000000 -0.01045914
## free_to_total_ratio -0.010459139 1.00000000
Using a matrix correlation on small groups of variables as I did above made it easier to pinpoint which input variables can contribute to classification of wine quality. These are alcohol, citric acid, sulphates (all positive correlation) and volatile acidity (negative correlation).
The strong positive correlation between density and the variables residual sugar and fixed acidity was interesting.
The strongest relationships between the output variable (quality) and input variables is the alcohol content.
Among other input variables, the strongest correlations are density and fixed acidity; free and total sulfur dioxide; pH and fixed acidity. The rest of the variable pairs are found to have correlations at a lower extent.
In this section, I wanted to focus on the correaltions noted in the previous summary section. Specifically, the variables with the strongest relationships to the output variable quality. Additionally, to review any relationships which may have been overlooked in prior single and bivariate analysis, I wanted to look at the strongest correletions between the other input variables from a multivariate approach.
The next several plots look at the four variables which show the strongest influenece on quality: alcohol, volatile acidity, citric acid and sulphates.
The first relationship presented is between Alcohol and Volatile Acidity relative to Quality. The most outstanding feature in this plot is in Quality category 8 showing the strong positive correlation between increase in alcohol and increase in volatile acidity. It was previously stated that volatile acidity contributes to the unpleasant taste of a wine and later shown that generally increase in volatile acidity has a negative effect on quality rating. It was also shown that increased alcohol content often yields higher quality rating. Thus, the category 8 feature may indicate alcohol mitigates unpleasant taste. Here I present alcohol and citric acid relative to quality. Again, quality category 8 stands out showing the negative correlation between alcohol and citric acid.
Finally, let’s look at alcohol and sulphates realtive to quality. Increased amounts of sulphates improve the quality rating. As stated earlier, sulphates are often added to preserve the wine over it’s shelf time.
Now, combining three variables to classify the quality of Red Wines.
For personal interest, I wanted to look at quality in relation to density and residual sugar. As a person who enjoys red wine occasionally, I have a personal preference for what I describe as a’red whine which is kinda heavy with a touch of sweetness’. ‘My Plot’ shows that I probably enjoy the outliers in the mid-quality range; so my palette may not be ‘refined’ but I’m happy with it.
The final two plots look at the strongest correletions between the other input variables: the group of density, fixed acidity, residual sugar seemed of interest as did the group of fixed acidity, free sulfur dioxide, total sulfur dioxide.
The input variables alcohol content, sulphates and volatile acidity seem to be the main drivers in quality as depicted in the scatter plots above. The most suprising part for me in this section of the analysis is that sulphates showed to be potentially the strongest driver for quality whereas previous preliminary indications were the alcohol content and citric acid had a greater influence on quality.
I was suprised that density and residual sugar did not have a stronger influence on quality. Additionally, when I first noted how much alcohol influence quality I was suspect since the quality rating involved human scoring. Very interesting and suprising was the strong influence sulphates have on quality. I personally know some people with small vineyards and want to discuss the sulphate content of their soil.
Most of the Red Wines have a Quality rating between 5 and 6 and the dataset shows a normal distribution. This would be expected when using a human measurement for quality over a specified scale of 1 to 10. Description Plot Two T Quality ratings for Red Wines increase with increased levels of Alcohol and Sulphates. These observations stand to reason on the following basis: 1)the quality rating is determined by a human panel of wine experts whom would likely prefer a stronger alcohol content; 2) sulphates are often added to wine to preserve the taste over time and would likely be higher in the wines receiving a higher quality score.
Description Plot Three
The final plot shows wine quality significantly improves with increased levels of citric acid and significanly decreases with greater levels of volatile acid. Citric acid, measured in g/L of the substance, adds “freshness” and flavor to wines. Citric acid the strongest natural differentiating factor in these wines. Citric acid is a reflection of the specific grape shining through in the wine from it. Volatile acidity is the amount in grams of acetic acid per liter of wine. High levels of this compound contributes to the unpleasant taste of wine. The median and the mean values indicate a very normal distribution and the plots show that most redwines in the dataset have a volatile acidity ranging from 0.3 to 0.6 with higher levels receiving a poor quality score and lower levels receiving a higher quality score. Reflections
When starting to examine the dataset, I excited by the numerous components of wine in order to determine which had the most differentiating effect. As I got into the analysis it was interesting to see how concentrated and or disburesed each component was across the dataset. The direction of the analysis was shaped by the different variables effect based on the quality rating; this seemed to be the most natural approach given the variables. It may be interesting to expand the dataset to include a price variable and use price as the primary comparison variable. It may also be interesting in an expanded dataset to look at the age of the wine relative to such components as the sulphate content. It would also be interesting to examine soil content from where the grapes originated relative to their citric acid and volatile acid content. Some of the specific struggles during this project were very basic and some based in conflict between imagination and knowledge limitation. One basic struggle was in getting setup: I am using Windows 10 and could not get r and r studio to work in a new project sub-directory; so following the path of least resistance, I completed the work in the main directory and then copied to files to the new project directory to upload to Git. More relevant, I would imagine ways (plots) to present the variables relationships, particularly the bivariate and multivariate sections and then I could not get the code to function. This led me to initially present more basic representations. Specifically, my first review notes strongly suggested a regression line with my bivariate scatterplot analysis and the reviewer provided sample code snippet. It was then that I realized I had previously been making a minor syntax error which had led me to present the project without the regression layer. In my second set of review notes, it was suggested that I create a fucntion to avoid repetitive code and a snippet was provided. The code ran but the issue became that I lost control of my y axis when using the function. In this case, I tried several adjustments which I thought would address the problem but did not. I researched with Google trying to explain the issue and finally I posted the code and delimma in the Udacity forum. As I am writing this (to address a review note), I still do not have this ‘struggle’ resolved, however therein lies what I feel were my successes in the project . . . perserverance! At each step in the project, I pretty much knew that there was a method, function or library in R to accomplish the juxposition of the data that I was imaging in order to see the data relationships and or tell the stroy. And a footnote to this paragraph, a big thanks to Myles in the Udacity forum for assisting on helping solve my repetitive code issue noted above. Thus, my two objectives were to leatn some EDA and complete/pass this project; so if my third reviewer actually reads all of the project and gets to this paragraph then he/she will know what to do and I will call that Success ;) ! Exploratory Data Analysis is a process. Like real life exploring the process goes one step at a time with knowledge building on itself and suggesting the next step. Another similarity is that the process leads not to conclutions (as that is not the purpose) but to insights. Finally, while there may be no definitive end to exploration, one concludes exploration ‘by letting the experience or in this case, the data speak for itself’.
References
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Factor variables http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm
Pair Plots http://sas-and-r.blogspot.com/2011/12/example-917-much-better-pairs-plots.html Adding and removing columns from a data frame http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/
Exploratory data analysis and data http://r4ds.had.co.nz/r-markdown.html#text-formatting-with-markdown
Exploratory Data Analysis on Wine Quality by Bilal Mahmood https://rpubs.com/Bilal_Mahmood/EDA
Wine Quality Analysis: http://rstudio-pubs-static.s3.amazonaws.com/24803_abbae17a5e154b259f6f9225da6dade0.html
Wine Quality Analysis https://github.com/mudspringhiker/exploratory_data_analysis_wines_using_R/blob/master/eda_wines_varshal.Rmd
Correlation matrix http://www.cookbook-r.com/Graphs/Correlation_matrix/
An introduction to corrplot package https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Diamonds exploration by Chris Saden: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/diamondsExample.html
Knitr with R Markdown http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html